AITopics | vision module

Collaborating Authors

vision module

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Modular Object Detection System for Humanoid Robots Using YOLO

Pottier, Nicolas, Lau, Meng Cheng

arXiv.org Artificial IntelligenceOct-16-2025

Within the field of robotics, computer vision remains a significant barrier to progress, with many tasks hindered by inefficient vision systems. This research proposes a generalized vision module leveraging YOLOv9, a state-of-the-art framework optimized for computationally constrained environments like robots. The model is trained on a dataset tailored to the FIRA robotics Hurocup. A new vision module is implemented in ROS1 using a virtual environment to enable YOLO compatibility. Performance is evaluated using metrics such as frames per second (FPS) and Mean Average Precision (mAP). Performance is then compared to the existing geometric framework in static and dynamic contexts. The YOLO model achieved comparable precision at a higher computational cost then the geometric model, while providing improved robustness.

artificial intelligence, detection, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.13625

Country: Asia > Japan (0.28)

Genre: Research Report (0.83)

Industry:

Health & Medicine (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.88)

Add feedback

Learning Visually Grounded Domain Ontologies via Embodied Conversation and Explanation

Park, Jonghyuk, Lascarides, Alex, Ramamoorthy, Subramanian

arXiv.org Artificial IntelligenceDec-12-2024

In this paper, we offer a learning framework in which the agent's knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a low-resource visual processing scenario, in which the agent must learn to recognize distinct types of toy truck. The agent starts the learning process with no ontology about what types of trucks exist nor which parts they have, and a deficient model for recognizing those parts from visual input. The teacher's feedback to the agent's explanations addresses its lack of relevant knowledge in the ontology via a generic rule (e.g., "dump trucks have dumpers"), whereas an inaccurate part recognition is corrected by a deictic statement (e.g., "this is not a dumper"). The learner utilizes this feedback not only to improve its estimate of the hypothesis space of possible domain ontologies and probability distributions over them, but also to use those estimates to update its visual interpretation of the scene. Our experiments demonstrate that teacher-learner pairs utilizing explanations and corrections are more data-efficient than those without such a faculty.

explanation, learner, truck, (16 more...)

arXiv.org Artificial Intelligence

2412.0977

Country:

Asia > Middle East > Jordan (0.04)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report (1.00)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)

Add feedback

Automatically Detecting Online Deceptive Patterns in Real-time

Nayak, Asmit, Zhang, Shirley, Wani, Yash, Khandelwal, Rishabh, Fawaz, Kassem

arXiv.org Artificial IntelligenceNov-11-2024

Deceptive patterns (DPs) in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous across various digital platforms. While efforts to mitigate DPs have emerged from legal and technical perspectives, a significant gap in usable solutions that empower users to identify and make informed decisions about DPs in real-time remains. In this work, we introduce AutoBot, an automated, deceptive pattern detector that analyzes websites' visual appearances using machine learning techniques to identify and notify users of DPs in real-time. AutoBot employs a two-staged pipeline that processes website screenshots, identifying interactable elements and extracting textual features without relying on HTML structure. By leveraging a custom language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We implement AutoBot as a lightweight Chrome browser extension that performs all analyses locally, minimizing latency and preserving user privacy. Through extensive evaluation, we demonstrate AutoBot's effectiveness in enhancing users' ability to navigate digital environments safely while providing a valuable tool for regulators to assess and enforce compliance with DP regulations.

large language model, machine learning, real time system, (22 more...)

arXiv.org Artificial Intelligence

2411.07441

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

Li, Shunlei, Wang, Jin, Dai, Rui, Ma, Wanyu, Ng, Wing Yin, Hu, Yingbai, Li, Zheng

arXiv.org Artificial IntelligenceSep-29-2024

In modern healthcare, the demand for autonomous robotic assistants has grown significantly, particularly in the operating room, where surgical tasks require precision and reliability. Robotic scrub nurses have emerged as a promising solution to improve efficiency and reduce human error during surgery. However, challenges remain in terms of accurately grasping and handing over surgical instruments, especially when dealing with complex or difficult objects in dynamic environments. In this work, we introduce a novel robotic scrub nurse system, RoboNurse-VLA, built on a Vision-Language-Action (VLA) model by integrating the Segment Anything Model 2 (SAM 2) and the Llama 2 language model. The proposed RoboNurse-VLA system enables highly precise grasping and handover of surgical instruments in real-time based on voice commands from the surgeon. Leveraging state-of-the-art vision and language models, the system can address key challenges for object detection, pose optimization, and the handling of complex and difficult-to-grasp instruments. Through extensive evaluations, RoboNurse-VLA demonstrates superior performance compared to existing models, achieving high success rates in surgical instrument handovers, even with unseen tools and challenging items. This work presents a significant step forward in autonomous surgical assistance, showcasing the potential of integrating VLA models for real-world medical applications. More details can be found at https://robonurse-vla.github.io.

instrument, robonurse-vla, surgical instrument, (14 more...)

arXiv.org Artificial Intelligence

2409.1959

Country:

Asia > China > Hong Kong (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)
Europe > Portugal > Coimbra > Coimbra (0.04)
Europe > Italy (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Surgery (1.00)
Health & Medicine > Health Care Providers & Services > Nursing (0.85)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Latent Object Characteristics Recognition with Visual to Haptic-Audio Cross-modal Transfer Learning

Saito, Namiko, Moura, Joao, Uchida, Hiroki, Vijayakumar, Sethu

arXiv.org Artificial IntelligenceMar-15-2024

Recognising the characteristics of objects while a robot handles them is crucial for adjusting motions that ensure stable and efficient interactions with containers. Ahead of realising stable and efficient robot motions for handling/transferring the containers, this work aims to recognise the latent unobservable object characteristics. While vision is commonly used for object recognition by robots, it is ineffective for detecting hidden objects. However, recognising objects indirectly using other sensors is a challenging task. To address this challenge, we propose a cross-modal transfer learning approach from vision to haptic-audio. We initially train the model with vision, directly observing the target object. Subsequently, we transfer the latent space learned from vision to a second module, trained only with haptic-audio and motor data. This transfer learning framework facilitates the representation of object characteristics using indirect sensor data, thereby improving recognition accuracy. For evaluating the recognition accuracy of our proposed learning framework we selected shape, position, and orientation as the object characteristics. Finally, we demonstrate online recognition of both trained and untrained objects using the humanoid robot Nextage Open.

module, orientation, recognition, (16 more...)

arXiv.org Artificial Intelligence

2403.10689

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.81)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Learning to See Physical Properties with Active Sensing Motor Policies

Margolis, Gabriel B., Fu, Xiang, Ji, Yandong, Agrawal, Pulkit

arXiv.org Artificial IntelligenceNov-2-2023

In recent years, legged locomotion controllers have exhibited remarkable stability and control across a wide range of terrains such as pavement, grass, sand, ice, slopes, and stairs [1, 2, 3, 4, 5, 6, 7, 8]. State-of-the-art approaches using sim-to-real learning primarily rely on proprioception and depth sensing to perceive obstacles and terrain [5, 7, 8, 9, 10, 11, 12, 13, 14, 15]. These approaches discard valuable information about the terrain's material properties beyond geometry, such as slip, softness, etc., conveyed by color images. A primary reason for this choice is that sim-to-real transfer has been shown to work with depth images [5, 7, 10], but it remains unclear how well the transfer will work with color or RGB images. To utilize information beyond geometry, some works learn to predict task performance or task-relevant properties (e.g., traversability) from color images using data collected in the real world [16, 17, 18, 19, 20]. However, the terrain property predictors learned in prior works are task-or policy-specific, which limits their applicability to new tasks. To perceive a multipurpose representation of the terrain, we propose predicting the terrain's physical properties (e.g., friction, roughness) that are invariant to the policy and task.

active sensing motor policy, robot, terrain, (13 more...)

arXiv.org Artificial Intelligence

2311.01405

Country:

North America > United States > Massachusetts (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)
(2 more...)

Add feedback

Referential communication in heterogeneous communities of pre-trained visual deep networks

Mahaut, Matéo, Franzon, Francesca, Dessì, Roberto, Baroni, Marco

arXiv.org Artificial IntelligenceJul-31-2023

As large pre-trained image-processing neural networks are being embedded in autonomous agents such as self-driving cars or robots, the question arises of how such systems can communicate with each other about the surrounding world, despite their different architectures and training regimes. As a first step in this direction, we systematically explore the task of \textit{referential communication} in a community of heterogeneous state-of-the-art pre-trained visual networks, showing that they can develop, in a self-supervised way, a shared protocol to refer to a target object among a set of candidates. This shared protocol can also be used, to some extent, to communicate about previously unseen object categories of different granularity. Moreover, a visual network that was not initially part of an existing community can learn the community's protocol with remarkable ease. Finally, we study, both qualitatively and quantitatively, the properties of the emergent protocol, providing some evidence that it is capturing high-level semantic features of objects.

communication, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2302.08913

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > France (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(14 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
(3 more...)

Add feedback

Stereo Event-based Visual-Inertial Odometry

Wang, Kunfeng, Zhao, Kaichun, You, Zheng

arXiv.org Artificial IntelligenceJul-25-2023

Event-based cameras are new type vision sensors whose pixels work independently and respond asynchronously to brightness change with microsecond resolution, instead of providing standard intensity frames. Compared with traditional cameras, event-based cameras have low latency, no motion blur, and high dynamic range (HDR), which provide possibilities for robots to deal with some challenging scenes. We propose a visual-inertial odometry for stereo event-based cameras based on Error-State Kalman Filter (ESKF). The visual module updates the pose relies on the edge alignment of a semi-dense 3D map to a 2D image, and the IMU module updates pose by median integral. We evaluate our method on public datasets with general 6-DoF motion and compare the results against ground truth. We show that our proposed pipeline provides improved accuracy over the result of the state-of-the-art visual odometry for stereo event-based cameras, while running in real-time on a standard CPU (low-resolution cameras). To the best of our knowledge, this is the first published visual-inertial odometry for stereo event-based cameras.

artificial intelligence, event camera, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2303.05086

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.47)

Add feedback

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Berrios, William, Mittal, Gautam, Thrush, Tristan, Kiela, Douwe, Singh, Amanpreet

arXiv.org Artificial IntelligenceJun-28-2023

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2306.1641

Country:

Asia > China (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Assembly Planning from Observations under Physical Constraints

Chabal, Thomas, Strudel, Robin, Arlaud, Etienne, Ponce, Jean, Schmid, Cordelia

arXiv.org Artificial IntelligenceOct-25-2022

This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation. The proposed algorithm uses a simple combination of physical stability constraints, convex optimization and Monte Carlo tree search to plan assemblies as sequences of pick-and-place operations represented by STRIPS operators. It is efficient and, most importantly, robust to the errors in object detection and pose estimation unavoidable in any real robotic system. The proposed approach is demonstrated with thorough experiments on a UR5 manipulator.

artificial intelligence, assembly, planning & scheduling, (18 more...)

arXiv.org Artificial Intelligence

2204.09616

Country: North America > United States > New York (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.89)

Add feedback